Using Tranformers Models to Score Splice-Junctions: Why, Which and How¶
Fine-Tuning DNAbert1, DNAbert2 and NT¶
Setting up the environment¶
conda create -n env_dnabert2 python=3.8 conda activate env_dnabert2 git clone https://github.com/Zhihan1996/DNABERT_2.git cd DNABERT_2/ python3 -m pip install -r requirements.txt python3 -m pip install scikit-learn pip uninstall triton
DGX Setup:¶
sudo docker pull nvcr.io/nvidia/pytorch:23.08-py3 sudo docker run --gpus all -it --rm -v local_dir:/raid/Fadel/SpliceUp/ nvcr.io/nvidia/pytorch:23.08-py3 test123! docker run --gpus all -it --rm -v /raid/Fadel/SpliceUp:/SpliceUp nvcr.io/nvidia/pytorch:23.08-py3 test123!
Fine-Tuning using GUE dataset¶
https://drive.google.com/file/d/1GRtbzTe3UXYF1oW27ASNhYX3SZ16D7N2/view
bash /data/SpliceUp/scripts/tune.sh /data/SpliceUp/datasets/GUE/splice/reconstructed/ /data/SpliceUp/tuned_models/ GUE 400
datasets = "/raid/Fadel/SpliceUp/datasets/RefSeq200/"
# Collecting all splice junctions even if they dont follow our stict rules, these junction will be
# used to mask the genome with Ns in order to generat fake junction from the rest of the genome
# example for window=6
# Genome |--------------------------- exon -----------------| |---------- exon ----------|
# Genome after masking |--------------------------------------------NNNNNNNNNNNN NNNNNNNNNNNN---------------------|
Generating RefSeq datasets¶
# Define file paths
path_fasta = "/data/references/GRCh38_RefSeq_p14/GCF_000001405.40_GRCh38.p14_genomic.fna.gz"
path_gtf = "/data/references/GRCh38_RefSeq_p14/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz"
output_path = "/data/DeepSAP/datasets/"
pickles_path = "/data/DeepSAP/pickles/"
dataset = "RefSeq"
max_n_junctions = 500000 # No capping
# Read FASTA sequences into genome_sequences_table
genome_sequences_table = {} # Dictionary to store genome sequences
# Read GTF file and extract gene, transcript, and exon information
genes_table = {} # Dictionary to store gene annotations
transcripts_table = {} # Dictionary to store transcript annotations
exons_table = {} # Dictionary to store exon annotations
for window in [90, 150, 200, 400]:
# Collect strictly annotated splice junctions (for training)
strict_annotated_junctions_table = {} # Dictionary to store strict junctions
# Collect all splice junctions (for masking genome)
not_strict_annotated_junctions_table = {} # Dictionary to store non-strict junctions
# Define number of synthetic junctions to create
negative_dataset_factor = 1.0
n_made_up_junctions = negative_dataset_factor * len(strict_annotated_junctions_table.keys())
# Generate synthetic junction sequences from non-strict junctions
made_up_junctions_table = {} # Dictionary to store made-up junctions
# Write tuning dataset files (train, eval, test) to output directory
path = output_path + dataset + str(window)
# Write dataset for prediction (dev set) after fine-tuning
path = output_path + dataset + str(window) + "_PREDICT"
# Save strict junctions table as a pickle file
path = pickles_path + "strict_annotated_junctions_table__" + dataset + "__" + str(window) + ".pkl"
# Save non-strict junctions table as a pickle file
path = pickles_path + "not_strict_annotated_junctions_table__" + dataset + "__" + str(window) + ".pkl"
Generating GENCODE datasets¶
# Define file paths
path_fasta = "/data/references/GRCh38_v44/GRCh38.primary_assembly.genome.fa.gz"
path_gtf = "/data/references/GRCh38_v44/gencode.v44.primary_assembly.annotation.gtf.gz"
output_path = "/data/DeepSAP/datasets/"
pickles_path = "/data/DeepSAP/pickles/"
dataset = "GENCODE"
max_n_junctions = 500000 # No capping
# Read FASTA sequences into genome_sequences_table
genome_sequences_table = {} # Dictionary to store genome sequences
# Read GTF file and extract gene, transcript, and exon information
genes_table = {} # Dictionary to store gene annotations
transcripts_table = {} # Dictionary to store transcript annotations
exons_table = {} # Dictionary to store exon annotations
for window in [90, 150, 200, 400]:
# Collect strictly annotated splice junctions (for training)
strict_annotated_junctions_table = {} # Dictionary to store strict junctions
# Collect all splice junctions (for masking genome)
not_strict_annotated_junctions_table = {} # Dictionary to store non-strict junctions
# Define number of synthetic junctions to create
negative_dataset_factor = 1.0
n_made_up_junctions = (negative_dataset_factor + 0.5) * len(strict_annotated_junctions_table.keys())
# Generate synthetic junction sequences from non-strict junctions
made_up_junctions_table = {} # Dictionary to store made-up junctions
# Write tuning dataset files (train, eval, test) to output directory
path = output_path + dataset + str(window)
# Write dataset for prediction (dev set) after fine-tuning
path = output_path + dataset + str(window) + "_PREDICT"
# Save strict junctions table as a pickle file
path = pickles_path + "strict_annotated_junctions_table__" + dataset + "__" + str(window) + ".pkl"
# Save non-strict junctions table as a pickle file
path = pickles_path + "not_strict_annotated_junctions_table__" + dataset + "__" + str(window) + ".pkl"
Generating MS150 dataset¶
Homo_sapiens (RefSeq)¶
https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.fna.gz https://ftp.ncbi.nlm.nih.gov/genomes/refseq/vertebrate_mammalian/Homo_sapiens/reference/GCF_000001405.40_GRCh38.p14/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz
Homo_sapiens (GENCODE)¶
https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/GRCh38.primary_assembly.genome.fa.gz https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_44/gencode.v44.primary_assembly.annotation.gtf.gz
Danio rerio¶
https://ftp.ensembl.org/pub/release-110/fasta/danio_rerio/dna/Danio_rerio.GRCz11.dna.toplevel.fa.gz https://ftp.ensembl.org/pub/release-110/gtf/danio_rerio/Danio_rerio.GRCz11.110.gtf.gz
Mus_musculus:¶
https://ftp.ensembl.org/pub/release-110/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna.toplevel.fa.gz https://ftp.ensembl.org/pub/release-110/gtf/mus_musculus/Mus_musculus.GRCm39.110.gtf.gz
Drosophila_melanogaster¶
https://ftp.ensembl.org/pub/release-110/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz https://ftp.ensembl.org/pub/release-110/gtf/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.46.110.gtf.gz
Caenorhabditis_elegans¶
https://ftp.ensembl.org/pub/release-110/fasta/caenorhabditis_elegans/dna/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz https://ftp.ensembl.org/pub/release-110/gtf/caenorhabditis_elegans/Caenorhabditis_elegans.WBcel235.110.gtf.gz
Rattus norvegicus¶
https://ftp.ensembl.org/pub/release-110/fasta/rattus_norvegicus/dna/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz https://ftp.ensembl.org/pub/release-110/gtf/rattus_norvegicus/Rattus_norvegicus.mRatBN7.2.110.gtf.gz
Plasmodium falciparum¶
http://ftp.ensemblgenomes.org/pub/protists/release-57/fasta/plasmodium_falciparum/dna/Plasmodium_falciparum.ASM276v2.dna.toplevel.fa.gz http://ftp.ensemblgenomes.org/pub/protists/release-57/gtf/plasmodium_falciparum/Plasmodium_falciparum.ASM276v2.57.gtf.gz
Plasmodium vivax¶
http://ftp.ensemblgenomes.org/pub/protists/release-57/fasta/plasmodium_vivax/dna/Plasmodium_vivax.ASM241v2.dna.toplevel.fa.gz http://ftp.ensemblgenomes.org/pub/protists/release-57/gtf/plasmodium_vivax/Plasmodium_vivax.ASM241v2.57.gtf.gz
Plasmodium berghei¶
http://ftp.ensemblgenomes.org/pub/protists/release-57/fasta/plasmodium_berghei/dna/Plasmodium_berghei.PBANKA01.dna.toplevel.fa.gz http://ftp.ensemblgenomes.org/pub/protists/release-57/gtf/plasmodium_berghei/Plasmodium_berghei.PBANKA01.57.gtf.gz
Plasmodium knowlesi¶
http://ftp.ensemblgenomes.org/pub/protists/release-57/fasta/plasmodium_knowlesi/dna/Plasmodium_knowlesi.ASM635v1.dna.toplevel.fa.gz http://ftp.ensemblgenomes.org/pub/protists/release-57/gtf/plasmodium_knowlesi/Plasmodium_knowlesi.ASM635v1.57.gtf.gz
# Define file paths
path_fasta = "/Users/zingo/Downloads/danio/Danio_rerio.GRCz11.dna.toplevel.fa.gz"
path_gtf = "/Users/zingo/Downloads/danio/Danio_rerio.GRCz11.110.gtf.gz"
output_path = "/Users/zingo/Downloads/datasets/"
pickles_path = "/Users/zingo/Downloads/pickles/"
dataset = "DANIO"
max_n_junctions = 500000 # No capping
# Read FASTA sequences into genome_sequences_table
genome_sequences_table = {} # Dictionary to store genome sequences
# Read GTF file and extract gene, transcript, and exon information
genes_table = {} # Dictionary to store gene annotations
transcripts_table = {} # Dictionary to store transcript annotations
exons_table = {} # Dictionary to store exon annotations
# Define window size
window = 150
# Collect strictly annotated splice junctions (for training)
DANIO_strict_annotated_junctions_table = {} # Dictionary to store strict junctions
# Define file paths
path_fasta = "/Users/zingo/Downloads/GRCm39/Mus_musculus.GRCm39.dna.toplevel.fa.gz"
path_gtf = "/Users/zingo/Downloads/GRCm39/Mus_musculus.GRCm39.110.gtf.gz"
output_path = "/Users/zingo/Downloads/datasets/"
pickles_path = "/Users/zingo/Downloads/pickles/"
dataset = "MUS"
max_n_junctions = 500000 # No capping
# Read FASTA sequences into genome_sequences_table
genome_sequences_table = {} # Dictionary to store genome sequences
# Read GTF file and extract gene, transcript, and exon information
genes_table = {} # Dictionary to store gene annotations
transcripts_table = {} # Dictionary to store transcript annotations
exons_table = {} # Dictionary to store exon annotations
# Define window size
window = 150
# Collect strictly annotated splice junctions (for training)
MUS_strict_annotated_junctions_table = {} # Dictionary to store strict junctions
# Define file paths
path_fasta = "/Users/zingo/Downloads/droso/Drosophila_melanogaster.BDGP6.46.dna.toplevel.fa.gz"
path_gtf = "/Users/zingo/Downloads/droso/Drosophila_melanogaster.BDGP6.46.110.gtf.gz"
output_path = "/Users/zingo/Downloads/datasets/"
pickles_path = "/Users/zingo/Downloads/pickles/"
dataset = "DROSO"
max_n_junctions = 500000 # No capping
# Read FASTA sequences into genome_sequences_table
genome_sequences_table = {} # Dictionary to store genome sequences
# Read GTF file and extract gene, transcript, and exon information
genes_table = {} # Dictionary to store gene annotations
transcripts_table = {} # Dictionary to store transcript annotations
exons_table = {} # Dictionary to store exon annotations
# Define window size
window = 150
# Collect strictly annotated splice junctions (for training)
DROSO_strict_annotated_junctions_table = {} # Dictionary to store strict junctions
# Define file paths
path_fasta = "/Users/zingo/Downloads/elegans/Caenorhabditis_elegans.WBcel235.dna.toplevel.fa.gz"
path_gtf = "/Users/zingo/Downloads/elegans/Caenorhabditis_elegans.WBcel235.110.gtf.gz"
output_path = "/Users/zingo/Downloads/datasets/"
pickles_path = "/Users/zingo/Downloads/pickles/"
dataset = "ELEGANS"
max_n_junctions = 500000 # No capping
# Read FASTA sequences into genome_sequences_table
genome_sequences_table = {} # Dictionary to store genome sequences
# Read GTF file and extract gene, transcript, and exon information
genes_table = {} # Dictionary to store gene annotations
transcripts_table = {} # Dictionary to store transcript annotations
exons_table = {} # Dictionary to store exon annotations
# Define window size
window = 150
# Collect strictly annotated splice junctions (for training)
ELEGANS_strict_annotated_junctions_table = {} # Dictionary to store strict junctions
# Define file paths
path_fasta = "/Users/zingo/Downloads/rattus/Rattus_norvegicus.mRatBN7.2.dna.toplevel.fa.gz"
path_gtf = "/Users/zingo/Downloads/rattus/Rattus_norvegicus.mRatBN7.2.110.gtf.gz"
output_path = "/Users/zingo/Downloads/datasets/"
pickles_path = "/Users/zingo/Downloads/pickles/"
dataset = "RATTUS"
max_n_junctions = 500000 # No capping
# Read FASTA sequences into genome_sequences_table
genome_sequences_table = {} # Dictionary to store genome sequences
# Read GTF file and extract gene, transcript, and exon information
genes_table = {} # Dictionary to store gene annotations
transcripts_table = {} # Dictionary to store transcript annotations
exons_table = {} # Dictionary to store exon annotations
# Define window size
window = 150
# Collect strictly annotated splice junctions (for training)
RATTUS_strict_annotated_junctions_table = {} # Dictionary to store strict junctions
# Define file paths
path_fasta = "/Users/zingo/Downloads/GRCh38_v44/GRCh38.primary_assembly.genome.fa.gz"
path_gtf = "/Users/zingo/Downloads/GRCh38_v44/gencode.v44.primary_assembly.annotation.gtf.gz"
output_path = "/Users/zingo/Downloads/datasets/"
pickles_path = "/Users/zingo/Downloads/pickles/"
dataset = "GENCODE"
max_n_junctions = 500000 # No capping
# Read FASTA sequences into genome_sequences_table
genome_sequences_table = {} # Dictionary to store genome sequences
# Read GTF file and extract gene, transcript, and exon information
genes_table = {} # Dictionary to store gene annotations
transcripts_table = {} # Dictionary to store transcript annotations
exons_table = {} # Dictionary to store exon annotations
# Define window size
window = 150
# Collect strictly annotated splice junctions (for training)
GENCODE_strict_annotated_junctions_table = {} # Dictionary to store strict junctions
# List of datasets and their corresponding variable names
datasets = {
"GENCODE": GENCODE_strict_annotated_junctions_table,
"DANIO": DANIO_strict_annotated_junctions_table,
"MUS": MUS_strict_annotated_junctions_table,
"DROSO": DROSO_strict_annotated_junctions_table,
"ELEGANS": ELEGANS_strict_annotated_junctions_table,
"RATTUS": RATTUS_strict_annotated_junctions_table
}
# Save each dataset as a pickle file
for species, junctions_table in datasets.items():
path = os.path.join(pickles_path, f"{species}_strict_annotated_junctions_table__{dataset}__{window}.pkl")
with open(path, 'wb') as handle:
pickle.dump(junctions_table, handle, protocol=pickle.HIGHEST_PROTOCOL)
# Define file paths
path_fasta = "/Users/zingo/Downloads/GRCh38_RefSeq_p14/GCF_000001405.40_GRCh38.p14_genomic.fna.gz"
path_gtf = "/Users/zingo/Downloads/GRCh38_RefSeq_p14/GCF_000001405.40_GRCh38.p14_genomic.gtf.gz"
output_path = "/Users/zingo/Downloads/datasets/"
pickles_path = "/Users/zingo/Downloads/pickles/"
dataset = "MultiSpecies"
max_n_junctions = 500000 # No capping
# Read FASTA sequences into genome_sequences_table
genome_sequences_table = {} # Dictionary to store genome sequences
# Read GTF file and extract gene, transcript, and exon information
genes_table = {} # Dictionary to store gene annotations
transcripts_table = {} # Dictionary to store transcript annotations
exons_table = {} # Dictionary to store exon annotations
# Define window size
window = 150
# Collect strictly annotated splice junctions (for training)
strict_annotated_junctions_table = {} # Dictionary to store strict junctions
# Add all other splice junctions from different species
for junction_db in junction_databases:
strict_annotated_junctions_table.update(junction_db)
# Collect all splice junctions (for masking genome)
not_strict_annotated_junctions_table = {} # Dictionary to store non-strict junctions
# Define number of synthetic junctions to create
negative_dataset_factor = 1.0
n_made_up_junctions = negative_dataset_factor * len(strict_annotated_junctions_table.keys())
# Generate synthetic junction sequences from non-strict junctions
made_up_junctions_table = {} # Dictionary to store made-up junctions
# Write tuning dataset files (train, eval, test) to output directory
path = output_path + dataset + str(window)
# Write dataset for prediction (dev set) after fine-tuning
path = output_path + dataset + str(window) + "_TRUE"
# Save strict junctions table as a pickle file
path = pickles_path + "Multi_strict_annotated_junctions_table__" + dataset + "__" + str(window) + ".pkl"
# Save non-strict junctions table as a pickle file
path = pickles_path + "Multi_not_strict_annotated_junctions_table__" + dataset + "__" + str(window) + ".pkl"
print("______________________________________________________________________________________________________________________________\n")
Plotting evaluation metrics:¶
from utility_transformers import parse_tuned, plot_tuning, plot_tuning_metric_per_window
path = "/data/SpliceUp/tuned_models/"
df = parse_tuned(path, level="eval")
plot_tuning(df)
# Save the plots
for window in [90, 200, 400, 150]:
filtered_df = df.loc[df['Window'] == window]
plot_tuning_metric_per_window(filtered_df, "/data/SpliceUp/DeepSAP-main/scripts/plots/tuning/tuning_" + f"{window}", window)
Processing DNABERT1_k6__Splice__GENCODE150 Processing DNABERT1_k6__Splice__GENCODE400 Processing DNABERT1_k6__Splice__MultiSpecies150 Processing DNABERT1_k6__Splice__PF_r57_90 Processing DNABERT1_k6__Splice__PF_r57_150 Processing DNABERT1_k6__Splice__PF_r57_200 Processing DNABERT1_k6__Splice__PF_r57_400 Processing DNABERT1_k6__Splice__RefSeq90 Processing DNABERT1_k6__Splice__RefSeq150 Processing DNABERT1_k6__Splice__RefSeq200 Processing DNABERT1_k6__Splice__RefSeq400 Processing DNABERT2__Splice__GENCODE150 Processing DNABERT2__Splice__GENCODE400 Processing DNABERT2__Splice__PF_r57_90 Processing DNABERT2__Splice__PF_r57_150 Processing DNABERT2__Splice__PF_r57_200 Processing DNABERT2__Splice__PF_r57_400 Processing DNABERT2__Splice__RefSeq90 Processing DNABERT2__Splice__RefSeq150 Processing DNABERT2__Splice__RefSeq200 Processing DNABERT2__Splice__RefSeq400 Processing NT_2.5B_850g_multi_species__Splice__RefSeq90 Processing NT_2.5B_850g_multi_species__Splice__RefSeq90_wot_LoRA Processing NT_2.5B_850g_multi_species__Splice__RefSeq150 Processing NT_2.5B_850g_multi_species__Splice__RefSeq150_wot_LoRA Processing NT_2.5B_850g_multi_species__Splice__RefSeq200 Processing NT_2.5B_850g_multi_species__Splice__RefSeq200_wot_LoRA Processing NT_2.5B_850g_multi_species__Splice__RefSeq400 Processing NT_2.5B_850g_multi_species__Splice__RefSeq400_wot_LoRA
from utility_transformers import parse_tuned, plot_evaluating
path = "/data/SpliceUp/tuned_models/"
df = parse_tuned(path, level="test")
plot_evaluating(df, "/data/SpliceUp/DeepSAP-main/scripts/plots/evaluation/evaluating", metric='MCC')
df.to_csv("/data/SpliceUp/DeepSAP-main/scripts/plots/evaluation/evaluation.csv")
Processing DNABERT1_k6__Splice__GENCODE150 Processing DNABERT1_k6__Splice__GENCODE400 Processing DNABERT1_k6__Splice__MultiSpecies150 Processing DNABERT1_k6__Splice__PF_r57_90 Processing DNABERT1_k6__Splice__PF_r57_150 Processing DNABERT1_k6__Splice__PF_r57_200 Processing DNABERT1_k6__Splice__PF_r57_400 Processing DNABERT1_k6__Splice__RefSeq90 Processing DNABERT1_k6__Splice__RefSeq150 Processing DNABERT1_k6__Splice__RefSeq200 Processing DNABERT1_k6__Splice__RefSeq400 Processing DNABERT2__Splice__GENCODE150 Processing DNABERT2__Splice__GENCODE400 Processing DNABERT2__Splice__PF_r57_90 Processing DNABERT2__Splice__PF_r57_150 Processing DNABERT2__Splice__PF_r57_200 Processing DNABERT2__Splice__PF_r57_400 Processing DNABERT2__Splice__RefSeq90 Processing DNABERT2__Splice__RefSeq150 Processing DNABERT2__Splice__RefSeq200 Processing DNABERT2__Splice__RefSeq400 Processing NT_2.5B_850g_multi_species__Splice__RefSeq90 Processing NT_2.5B_850g_multi_species__Splice__RefSeq90_wot_LoRA Processing NT_2.5B_850g_multi_species__Splice__RefSeq150 Processing NT_2.5B_850g_multi_species__Splice__RefSeq150_wot_LoRA Processing NT_2.5B_850g_multi_species__Splice__RefSeq200 Processing NT_2.5B_850g_multi_species__Splice__RefSeq200_wot_LoRA Processing NT_2.5B_850g_multi_species__Splice__RefSeq400 Processing NT_2.5B_850g_multi_species__Splice__RefSeq400_wot_LoRA